Layout and Language: Challenges for Table Understanding on the Web

نویسنده

  • Matthew Hurst
چکیده

In this paper, we consider the table understanding task and present a catalogue of particular issues that arise when the tables are those found on the web. In addition, we consider what happens when processes commonly associated with web pages are applied to those bearing tables. 1 Table Understanding and the Web The ubiquity of tables, and their ability to describe relational information in a compact and immediate manner make them attractive targets for automated understanding. Recent research into the automatic location, recognition and understanding of tables has demonstrated the viability of integrating automated table processing systems into larger knowledge management applications ([8]). However, table understanding is still a relatively novel research area, one whose definition and terminology are still not fixed. It is useful to break the task down into some subtasks, and to consider them in turn with respect to the understanding of tables delivered on the web. Generally, table processing can be conceptualized as consisting of table location; table recognition; functional and structural analysis; and finally interpretation the extraction of meaningful and unambiguously structured information ([4]). We concentrate on the first two tasks in the following. location table location is the processes of spotting tables in documents. Traditionally, this task comes in two basic forms document image sourced tables ([7], [3]) and electronic text sourced tables including HTML ([1]). The problem is extended to include the spotting of tables in other document encodings such as postscript, pdf, rtf, word, etc. In general, when considering tables on the web, the appropriate HTML tags are exploited (TABLE, TH, TD, etc.). However, this is where we come to the first two distinguishing points. the presence of the TABLE tag in an HTML document does not necessarily indicate the presence of a table ([1] suggest less than 30 % of HTML TABLEs are real tables in one particular domain). there are many other ways in which tables may be presented in web delivered documents plain text (PRE), images, mixtures of table specific tags (TABLE, etc.) and tags used within the table for their functionality in terms of placing text spatially (PRE, LI, etc.) see Figure 1 for an example of such complexities. The first point requires the creation of accurate classification technology. Given any TABLE node in the HTML, the classifier must accept or reject it. Such a classifier may be built either via hand crafted rules ([1]) or using a machine learning approach. Experiments suggest that a machine learning approach using a naive bayse classifier ([9]) based on a feature set describing the set of tags below the potential TABLE node in the document tree produces adequate results. Locating tables encoded in other formats requires technology from other areas. For example, images of tables may be processed by techniques from the document image field ([2]), pre-formatted tables (using the PRE) tag may be processed using plain text table methods ([5]). However, the classification problem extends to these cases and individual classifiers must be constructed to make decisions about document elements of each type. The remaining outstanding issues relate to the mixture of encoding types (e.g. tables built out of TABLE nodes and pre-formatted elements), as well as the mixture of encoding purposes (e.g. the use of the HTML TABLE to encode surrounding text as well as an embedded table). Figure 1. A web page using a mixture of HTML tables (on the left) and images of tables (on the right). recognition table recognition is the task of segmenting the original description of the table into a relative spatial description. In general this task is required when the input is low-level, such as a document image or an electronic text. Clearly, if such tables are found on a web page, the same process is required. Again, given certain assumptions, we can take the marked up tables in a web page to be the logical spatial table. However, there are certain issues that need to be understood in order to account for certain variations: internal cell structure though tags like TH and TD may be assumed to delimit a single cell in the table, there are cases where other non-table tags are used to provide internal structure in such a way as to associate the cell’s contents with those of other cells. A solution would be required to apply a certain amount of recursive processing working into the structure and building a unified abstract table. split cells in order to gain more control over the distribution of the text in a cell, authors occasionally split the text and place it in two or more adjacent cells. This problem may be accommodated by exploiting linguistic process as described in [6] where the content of the cell can be used to indicate continuity, if any, to other cells. errors spanning errors occur when the COLSPAN or ROWSPAN values are not correctly calculated. There are two cases. In the first the cell spans beyond the border of the intended table giving the cell incorrect coordinates. In the second, the span of the cell does not communicate the correct meaning of the cell. For example, a cell that is intended to span three cells below it spans only one leading to ambiguity. The first type of problem may be repaired by some form of normalization, whereas the second requires intelligent processing in order to distinguish the following two cases:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

English Teachers Professional Development Needs for Web Development Skills: Meeting the Challenges of Teaching English Language in the Information Age

Utilizing the resources of the web in educational practices has made instructional processes more efficient and interesting and has made the learning process on the other hand much easier and attractive. With the web, English language teachers now have the option of engaging learners in online (web-based) instructions in addition to the use of conventional classroom instructions or alternativel...

متن کامل

Iranian English Language Teachers’ Perception of Task-based Language Teaching (TBLT) Principles and Challenges to its Implementation

This paper presents the findings of a study on Iranian  English language teachers’ understanding of  Task-based language teaching (TBLT) principles and  perceived challenges of TBLT implementation in Iran. The data obtained from 100 respondents on a 39-item survey instrument and four essay questions analyzed through frequency statistics revealed that nearly 70 percent of teachers are cognizant ...

متن کامل

Deterministic Measurement of Reliability and Performance Using Explicit Colored Petri Net in Business Process Execution Language and Eflow

Today there are many techniques for web service compositions. Evaluation of quality parameters has great impact on evaluation of final product. BPEL is one of those techniques that several researches have been done on its evaluation. However, there are few researches on evaluation of QoS in eflow. This research tries to evaluate performance and reliability of eflow and BPEL through mapping them...

متن کامل

Effective Learning to Rank Persian Web Content

Persian language is one of the most widely used languages in the Web environment. Hence, the Persian Web includes invaluable information that is required to be retrieved effectively. Similar to other languages, ranking algorithms for the Persian Web content, deal with different challenges, such as applicability issues in real-world situations as well as the lack of user modeling. CF-Rank, as a ...

متن کامل

Detecting Tables in HTML Documents

Table is a commonly used presentation scheme, especially for describing relational information. Table understanding on the web has many potential applications including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as elements, often the tag is used liberally to ach...

متن کامل

An Executive Approach Based On the Production of Fuzzy Ontology Using the Semantic Web Rule Language Method (SWRL)

Today, the need to deal with ambiguous information in semantic web languages is increasing. Ontology is an important part of the W3C standards for the semantic web, used to define a conceptual standard vocabulary for the exchange of data between systems, the provision of reusable databases, and the facilitation of collaboration across multiple systems. However, classical ontology is not enough ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001